Database Read Cache Optimization

ABSTRACT

A system, method and apparatus for storing metadata in a metadata store in a robust and efficient manner including receiving a request from a client to perform a data transaction, updating a key-value pair in a metadata store based on the request, entering the data transaction in a transaction log, updating a read cache with the key-value pair, and replicating the last transaction log entry in at least one other storage node in the metadata store.

BACKGROUND

The present disclosure relates generally to managing transaction logsand read caches in a database. In particular, the present disclosurerelates to efficiently managing, duplicating, and migrating transactionlog and read cache data in a key-value store.

Metadata stores are one example of a key-value store that storestructural or descriptive information about other data (e.g., datastored in a large scale distributed storage system). In computer datastorage systems, particularly large scale distributed storage systems,the metadata stored in a metadata store may contain information aboutthe location or description of data stored in large scale distributedstorage system. Metadata is important in data storage systems forlocating and maintaining data stored in the data storage system.

Further, if a storage node or storage device of a metadata store fails,the metadata including difficult-to-recreate transactions may bepermanently lost. Storing the metadata redundantly on multiple datastorage nodes in a metadata store can aid in protecting against dataloss due to storage device failure. This redundant storage, however,consumes extra processing and storage resources.

For a large scale distributed storage system, maintaining a transactionlog of client interactions with data stored in the system aids inrecreating the current state, or a prior state, of the metadata.However, maintaining the transaction logs for a highly accessed largescale distributed storage system consumes extra processing and storageresources. Further, replaying the transaction logs to recreate a currentor prior state of the metadata can be slow and consume additionalprocessing resources.

Further, maintaining the metadata can consume valuable processing andstorage resources, especially when considering the scale of today'sstorage system. For instance, to allow users and administrators tobetter understand and manage their data files, a large amount ofmetadata, and thus storage resources, may be necessary to provide foreffective searches over the metadata. With increased scale and storageresources consumed by metadata, time to access the metadata store forprocessing metadata queries and searches is unavoidably increased.

SUMMARY

In view of the problems associated with managing large databases, suchas key-value stores, in a storage system, one object of the presentdisclosure is to provide a highly accessible read cache for thedatabase. This provides for quickly assessing the current status of thedatabase. The read cache may be created based on transaction logentries. To provide a failsafe recovery mechanism, the created readcache may be duplicated to a separate node in the storage system.

Another object of the present disclosure is to migrate some information,such as transaction log data, from a local, fast access storage systemto a more robust and cost-effective system, such as an additionalsecondary storage system, or even a distributed storage system. Thismigration minimizes the amount of data on fast, local storage for moreefficient accessing and processing without affecting the primaryfunctions of the database.

Still another object of the present disclosure is to generate a snapshotfor a read cache. By generating a snapshot of a read cache, the coveredtransaction log may be intentionally deleted to save storage space inthe storage system. In case of a read cache failure, instead of usingthe whole transaction log, a snapshot of read cache may be used toreplace the covered transaction log entries in replaying the read cache.Through this approach, a potential large amount of storage may be saved.

These and other objects of the present disclosure may be implemented ina metadata store, that is further described below with a briefdescription of example system components and steps to accomplish theabove and other objects for efficiently accessing and processingmetadata. However, the techniques introduced herein may be implementedwith various storage system structures database content.

The techniques introduced herein may include a method including:receiving a request from a client to perform a data transaction,updating a key-value pair in a metadata store based on the request,entering the data transaction in a transaction log, updating a readcache with the key-value pair, and replicating the last transaction logentry in at least one other storage node in the metadata store. Otheraspects include corresponding computer systems, apparatus, and computerprograms recorded on one or more computer storage devices, eachconfigured to perform the actions of the methods.

The techniques introduced herein may further include one or more of thefollowing features. The method where the data transaction includesstoring, retrieving, updating, or deleting a data object. The methodfurther includes copying a portion of the transaction log to atransaction log fragment object in a large scale distributed storagesystem. The method where replicating the last transaction log entryincludes copying a portion of the last transaction log entry to at leastone other storage node such that if one storage node fails there remainsenough of the read cache on the remaining storage nodes to fully restorethe read cache. The method further includes updating a read cache on alocal storage device. The method further includes replicating the readcache on at least one additional local storage device. The method wherethe at least one additional local storage device includes solid-statedrives. The method where replicating the read cache includes copying aportion of the read cache to the local storage devices such that if onelocal storage device fails there remains on the other local storagedevices enough of the read cache to fully restore the read cache. Themethod where the last transaction log entry is stored on a local storagedevice. The method where the local storage device includes a hard diskdrive.

The techniques introduced herein include a system having: acommunication bus; a network interface module communicatively coupled tothe communication bus; a storage interface module coupled to a storagedevice, the storage interface module communicatively coupled to thecommunication bus; a processor; and a memory module communicativelycoupled to the communication bus, the memory module includinginstructions that when executed by the processor causes the system toreceive a request from a client to perform a data transaction. Theinstructions further cause the processor to update a key-value pair in ametadata store based on the request. The instructions may also cause theprocessor to enter the data transaction in a transaction log. Theinstructions further cause the processor to update a read cache with thekey-value pair. The instructions also cause the processor to replicatethe last transaction log entry in at least one other storage node in themetadata store.

BRIEF DESCRIPTION OF THE DRAWINGS

The techniques introduced in the present disclosure are illustrated byway of example, and not by way of limitation in the figures of theaccompanying drawings in which like reference numerals are used to referto similar elements.

FIG. 1 is a block diagram illustrating an example of a multi-nodemetadata store with a transaction log.

FIG. 2 is a block diagram illustrating an example of a data storagedevice capable of storing metadata and a transaction log.

FIG. 3 is a block diagram illustrating an example of a multi-nodemetadata store with a read cache and part of a transaction log stored onlocal storage nodes and the rest of the transaction log stored on adistributed storage system.

FIG. 4 is a block diagram illustrating an example of a multi-nodemetadata store with a read cache stored on local nodes and a transactionlog stored on a distributed storage system.

FIG. 5 is a block diagram illustrating an example of a metadata storagemethod.

FIG. 6 is a block diagram illustrating an example of a metadata storagemethod with transaction log storage optimizations.

FIG. 7 is a block diagram illustrating an example of a metadata storeusing a read cache snapshot to reduce the size of the transaction log.

DETAILED DESCRIPTION

For purposes of illustration, the techniques described herein arepresented within the context of metadata stores. In particular, thetechniques described herein make reference to metadata stores for alarge scale distributed storage system. However, references to, andillustrations of, such environments and embodiments are strictly used asexamples and are not intended to limit the mechanisms described to thespecific examples provided. Indeed, the techniques described are equallyapplicable to any database using a transaction-like replicationmechanism, or any system with state transactions and an associated readcache.

According to the techniques disclosed herein, a read cache comprises thecurrent state of the metadata store. Maintaining a highly accessibleread cache can make accessing the current state of the metadata muchfaster and more efficient than replaying the transaction log. A loss ofthe read cache due to storage device or node failure, or some othercause of data corruption or loss, could require the replaying of thetransaction log from the beginning, or from some other known state inorder to recreate the current state of the metadata. Duplicating themetadata read cache across multiple data storage nodes can mitigate therisk of loss of the current state of the metadata at the cost ofadditional processing and storage resources.

FIG. 1 is a block diagram illustrating an example of a multi-nodemetadata store 102 coupled with a large scale distributed storage system114. The metadata store 102 may be accessed by a client 103. Themetadata store comprises a read cache 110 and a transaction log (“TLOG”)112. In the example of FIG. 1, the metadata store comprises at leastthree storage nodes 104, 106, and 108. However, there may be more thanthree data storage nodes as indicated by the ellipses between “node 1”106 and “node n” 108. The read cache 110 a, 110 b, and 110 n (alsoreferred to herein individually and collectively as read cache 110)caches the current state of the metadata. In the example of FIG. 1, 110n denotes the n^(th) copy of the read cache—corresponding to the n^(th)storage node. The number of nodes “n” may be any number greater than orequal to two. The TLOG 112 a, 112 b, and 112 n (also referred to hereinindividually and collectively as TLOG 112) stores metadata associatedwith client storage requests. Each storage request from the client 103is logged in the TLOG 112. The TLOG 112 thus comprises the sequence ofstorage requests made to the large scale distributed storage system 114.This TLOG 112 provides a failsafe recovery mechanism as the TLOG 112 canbe replayed in order to determine the most current state of the metadatastore 102. The metadata store 102 interfaces with and is communicativelycoupled to a large scale distributed storage system 114. The large scaledistributed storage system 114 stores data objects and the records inthe metadata store 102 include a mapping of object identifiers (IDs) torespective object locations in the large scale distributed storagesystem 114.

In one embodiment, the metadata store 102 is a key value store in which,for every key of a data object, data for retrieval of the data objectare stored. The key may be the name, object ID, or other identifier ofthe data object, and the data may be a list of the storage nodes onwhich redundantly encoded sub blocks of the data object are stored andavailable. It should be apparent that other database structures (e.g., arelational database) may be used to implement the metadata store.

The metadata store 102 may be replicated on a group of storage nodes. Inone embodiment, as depicted in the example of FIG. 1, three storagenodes 104, 106, and 108 cooperate in a majority vote configuration suchas Paxos. A sharded version of such a metadata store could bedistributed among a plurality of such node clusters.

As provided by the techniques introduced herein, the costs of data lossmay be mitigated by only duplicating the TLOG 112 or read cache 110across some of the available nodes in the metadata store 102. Forexample, the TLOG 112 and read cache 110 may be duplicated across amajority of the data storage nodes in the metadata store 102. In oneembodiment, the speed of access to the read cache 110 and the mostrelevant portion of the TLOG 112 can be addressed by storing the readcache 110 and the tail (e.g., the most recent portion) of the TLOG 112on fast storage. For example, the nodes of the metadata store 102 mayinclude fast, but perhaps expensive and less-durable, local solid-statedrives (“SSDs”) for storing the read cache 110 and portions of the TLOG112. Solid-state drives can provide faster data access relative tospinning platter hard disk drives. However, some SSDs operate such thateach storage location has a relatively limited number of write-cyclesbefore that location on the SSD wears out.

In one embodiment, when there are three nodes in the metadata store asin the example of FIG. 1, the read cache 110 is only duplicated on amajority of the data storage nodes in the metadata store 102, such asnode 0 104 and node 1 106, and the read cache 110 is not duplicated onadditional nodes, such as node n 108. By occasionally rotating whichnodes contain copies of the read cache 110, the wear across the fast,local drives in the metadata store 102 is spread more evenly across thedrives in the array. By only writing the read cache 110 to a subset ofthe nodes, such as nodes 104 and 106, the number of write operations tothe fast, local storage such as an SSD is reduced by as much as onethird and can thus prolong the life of the array of fast, local storage(which may be SSDs) in the metadata store 102. The reduced number ofread caches is still able to meet the consensus requirement for themajority voting algorithm of a Paxos cluster, for example, and so theconsistency of the read cache 110 in the nodes 104, 106 and 108 isguaranteed. In one embodiment, where the time to rebuild the read cache110 by replaying the TLOG 112 is sufficiently low, even a single copy ofthe read cache (e.g., read cache 110 a) on a single node of the clustercould suffice to satisfy the quorum. If the single copy of the readcache 110 were to fail, the read cache 110 could be restored on astill-functioning node of the cluster by replaying the TLOG 112 on thestill-functioning node.

While node 1 106 is labeled as the master node in FIG. 1, it should benoted that a master node is not necessarily always linked to node 1 106.A master node is dynamic and can be taken over by any other node in thesystem 100 if that node satisfies the requirements to become a masternode and the majority of the nodes in the system 100 agree that node tobecome a master node.

FIG. 2 illustrates an example of a computing device 200 capable ofstoring metadata and a transaction log for use as a node in a metadatastore, for example. The illustrated computing device 200 enables thesoftware and hardware modules of the device to communicate with eachother and store and execute data and code. The computing device 200comprises a central data bus 220. A network interface module 202connects to the central data bus 220 and allows other computing devices(e.g., the client 103) to interact with the computing device 200. Thestorage interface module 204 allows the computing system 200 tocommunicate with storage 206. Storage 206 can be non-volatile memory orsimilar permanent storage device and media, for example, a floppy diskdrive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RWdevice, a flash memory device, or some other mass storage device knownin the art for storing information on a more permanent basis. In someembodiments, storage 206 may comprise a fast storage device 208 such asan SSD, and a more robust, larger capacity, less expensive storagedevice 210 such as an HDD. However, alternate embodiments may beconfigured to use only SSDs or only HDDs for local storage. The storage206 may comprise other local storage device types, or even non-local,storage such as a large scale distributed storage system 114. Thecentral data bus 220 also communicatively couples the Network interfacemodule 202 and the storage interface module 204 to a processor 212 andmemory 214. The memory 214 may comprise a database manager 216 in someembodiments.

The processor 212 can include an arithmetic logic unit, amicroprocessor, a general-purpose controller or some other processorarray to perform computations. The processor 212 is coupled to thecentral data bus 220 for communication with the other components of thesystem 200. Although only a single processor is shown in FIG. 2,multiple processors or processing cores may be included.

The memory 214 can store instructions and/or data that may be executedby processor 212. The memory 214 is coupled to the central data bus 220for communication with the other components. The instructions and/ordata may include code for performing the techniques described herein.The memory 214 may be a dynamic random access memory (DRAM) device, astatic random access memory (SRAM) device, flash memory or some othermemory device known in the art.

FIG. 3 illustrates an example of a multi-node metadata store 102 with aread cache 110 and part of a transaction log 112 stored on local storagenodes and the rest of the transaction log stored on a distributedstorage system. In some embodiments, the metadata store 102 comprises afast, local storage array 301 and a more robust secondary storage array302. The system enables robust metadata storage in an efficient mannerwhich can reduce processing and storage requirements while providingdata protection or recovery in the event of partial system failure. Thesystem 300 comprises a metadata store 102 communicatively coupled to alarge scale distributed storage system 114.

The client 103 may initiate transactions with the large scaledistributed storage system 114. These transactions may alter the readcache 110 a, 110 b, and 110 n (also referred to herein individually andcollectively as 110). Client requests are logged in the TLOG 112. Insome embodiments, the read cache 110 is stored on one or more nodes offast local storage 301, which in some embodiments may be local SSDs. Themost recent client transactions are stored in the tail of the TLOG 320a, 320 b, and 320 n (also referred to herein individually andcollectively as 320). Like the read cache, the TLOG tail may be storedon a number of nodes, designated by the number “n” and duplicated toadditional nodes by copying 321 the transaction log from one node toanother.

Other segments of the TLOG 322, 324, and 326 (distributed across nodesa, b, n) may be stored, in some embodiments, on secondary local storagenodes 304, 306 and 308 in the secondary local storage array 302, whichin some embodiments may be local HDDs. Segments of the TLOG may becopied 323 to parallel storage nodes.

Still other segments of the TLOG 342-348 may be stored as one or moredata objects 340 in a large scale distributed storage system 114.

When the most recent TLOG entries in TLOG.tail 320 meet a certainthreshold, they may be migrated to the secondary local storage nodes304, 306, 308. The triggering threshold may be time-based trigger orwhen the TLOG.tail grows beyond a predetermined storage size-10 MB, forexample. By moving the TLOG.tail to a secondary node 304, 306, 308 anddesignating the TLOG.tail as a new TLOG element 322 a, an orderedsequence of TLOG files: TLOG.i+2 322 a, TLOG.i+1 324 a, TLOG.i 326 a,etc., are accumulated on the secondary local storage.

In some embodiments, in response to a threshold being satisfied, theread cache may be copied as a read cache snapshot, for instance ReadCache Snapshot.i 310, to secondary local storage nodes 304, 306, and/or308 and all other TLOG entries may be removed. Examples of such athreshold may include a time limit, a number of TLOG entries or a sizethreshold on the TLOG. After the read cache snapshot is created,subsequent TLOG entries may then be added to a new TLOG. Subsequently,when the read cache needs to restored, the restoration will take reducedtime due to beginning with the read cache snapshot and appending thesubsequent modifications from the new TLOG with a reduced size. In thisembodiment, the storage capacity requirement of the secondary localstorage nodes 304, 306, and 308 that hold the TLOG can be reduced.

As the limited nodes of the secondary storage of the metadata storeapproach capacity (or a counter or time-based threshold is used) theoldest segments of the TLOG, 326 and Read Cache Snapshot.i 310, forexample, can be migrated 330 to a more robust and more cost-effectivestorage system such as the large scale distributed storage system 114,as illustrated. The migration 330 of the plurality of replicas of theTLOG entries 320, 322, 324, 326, and Read Cache Snapshot.i 310 in themetadata store 102 are replaced by a single entry in the large scaledistributed storage system 114 which capitalizes on the robustness andefficiency of the large scale distributed storage system with its lowerstorage overhead and higher redundancy level. However, it is clear thatalternative embodiments for a remote storage system, such asNetwork-Attached Storage or RAID Arrays, are also possible. In the eventof a failure of an element of the metadata store, archived TLOG entriesremain accessible by means of the reference to the Read Cache Snapshot.1310 a at the end of the sequence of TLOG files in the metadata store 102or the Read Cache Snapshot.i-1 312 in the large scale distributedstorage system 114.

In one embodiment, TLOG files are migrated 330 from the metadata storeto the large scale distributed storage system 114 after all nodes of themetadata store are in sync for these TLOG files. For example, TLOG.i+2322 a is first synchronized to node 1 306 and all remaining nodes beforethe migration to the large scale distributed storage system 114 can takeplace.

FIG. 4 illustrates an example 400 of a multi-node metadata store with aread cache stored on local nodes and a transaction log stored on adistributed storage system. In this embodiment, the metadata store 102comprises the read cache 110, and the large scale distributed storagesystem 114 comprises the TLOG entries TLOG.i-1 342, TLOG.i-2 344,TLOG.i-3 346, . . . , and Read Cache Sanpshot.i-1 312. The metadatastore 102 comprises multiple local storage nodes 104, 106 and 108storing multiple copies of a metadata read cache 110. In thisembodiment, the entire TLOG 342-348 and Read Cache Snapshot.i-1 312 arestored in a large scale distributed storage system as objects 340.

FIG. 5 illustrates a method 500 for updating a metadata store accordingto some implementations of the present disclosure. At 501 the systemreceives a request from a client to store or retrieve a data object. Themethod may allow other operations on the data object, as well, such asupdating or deleting the data object. The data object may be an objectstored in the large scale distributed storage system 114 or a key-valuepair in the metadata store 102. At 502 the system updates a key-valuepair in a metadata store 102 based on the request. At 504 the systementers a record of the data transaction in a transaction log 112. Insome embodiments, the transaction log 112 may be a linked list oftransaction log entries. At 506 the system replicates the datatransaction entered into the transaction log (TLOG.tail) in at least oneother node in the metadata store. By having a copy of the data entriesof the TLOG.tail across storage nodes, the system may be more robust,especially in situations where a storage node or device fails, and thedata entries of the TLOG.tail are duplicated on another node and can beaccessed and fully restored from that node. At 508 the system updates aread cache 110 with the key-value pair. In some embodiments, the readcache 110 represents the current state of the metadata store 102. Insuch embodiments, the read cache 110 may be recreated by “replaying” thetransactions from the transaction log 112. The presence of a read cache110 that represents the current state of the metadata store 102 allowsfor faster access to the metadata.

FIG. 6 illustrates a method 600 for updating a metadata store accordingto some implementations of the present disclosure. At 601 the systemreceives a request from a client to store, retrieve, update or delete anobject. The object may be a large scale distributed storage systemobject or a key-value pair in the metadata store. At 602 the systemupdates a key-value pair in a metadata store based on the request. At604 the system updates the data transaction in a transaction log (TLOG)on a local hard disk drive. Hard disk drives represent a fast, localstorage device which may not be as fast as a solid-state drive, but maybe more monetarily feasible. At 606 the system replicates the datatransaction entered into the transaction log (TLOG.tail) in a least oneother local hard disk drive node. There may be similar considerations inthe storage strategy for the TLOG.tail as there are for the read cache,as mentioned previously, such as robustness by having duplicateversions, space savings by only duplicating the data entries of theTLOG.tail across some of the storage nodes, or storing a portion of thedata entries of the TLOG.tail across multiple storage nodes such thatthe failure of a node leaves enough of the TLOG.tail intact on theremaining storage nodes. In some embodiments, other storage options mayfill the role of the local hard disk drive storage. At 608 the systemupdates a read cache on each node of a local solid-state drive (in someembodiments) that contains the read cache. In other embodiments, othermethods of fast, local storage may be used in place of solid-statedrives. At 610 the system copies the local portion of the transactionlog from the local hard disk drive to a transaction log fragment objectin a distributed storage system. The distributed storage system may bedistributed over multiple nodes and multiple physical site locations forensuring robust data storage that is easily recoverable in the event offailure of one or more storage nodes, and even the inability to accesssome of the physical sites where the distributed storage system devicesare located.

FIG. 7 illustrates an example 700 of a metadata store using a read cachesnapshot to reduce the size of the transaction log. At “Time 1” themetadata store has a read cache 110 and a TLOG 112. The read cache 110is copied 701 as a read cache snapshot 710 so that the current state ofthe read cache 110 (which incorporates all of the state represented bythe TLOG 112) is captured in the read cache snapshot 710. Then, a shorttime later at “Time 2” the TLOG 112 is no longer needed since thecurrent state of the read cache is captured in the read cache snapshot710, and the TLOG 112 is deleted 705. Then, at “Time 3” the old TLOG 112no longer exists, and the read cache snapshot 710 serves as the newbaseline on which the new TLOG 712 can build, capturing the transactionsthat occur subsequent to creating the read cache snapshot 701. At thispoint in time, if the read cache 110 were lost, it could be rebuilt bystarting with the read cache snapshot 710 as a baseline and replayingthe new TLOG 712 on top of that baseline. This process should be muchfaster than replaying the entire TLOG 112 to regenerate the read cache110. Additionally, since the read cache 112 can grow quite large overtime, the process of creating 701 the read cache snapshot 710 anddeleting 705 the potentially very large TLOG 112 can save a potentiallyvery large amount of storage.

In the preceding description, for purposes of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the disclosure. It will be apparent, however, that thedisclosure can be practiced without these specific details. In otherinstances, structures and devices have been shown in block diagram formin order to avoid obscuring the disclosure. For example, the presentdisclosure has been described in some implementations above withreference to user interfaces and particular hardware. However, thepresent disclosure applies to any type of computing device that canreceive data and commands, and any devices providing services. Referencein the specification to “one implementation” or “an implementation”means that a particular feature, structure, or characteristic describedin connection with the implementation is included in at least oneimplementation of the disclosure. The appearances of the phrase “in oneimplementation” or “in some implementations” in various places in thespecification are not necessarily all referring to the sameimplementation.

Some portions of the detailed descriptions above are presented in termsof algorithms and symbolic representations of operations on data bitswithin a computer memory. These algorithmic descriptions andrepresentations are the means used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of steps leading to a desiredresult. The steps are those requiring physical manipulations of physicalquantities. Usually, though not necessarily, these quantities take theform of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers or the like.

It should be borne in mind, however, that these and similar terms are tobe associated with the appropriate physical quantities and are merelyconvenient labels applied to these quantities. Unless specificallystated otherwise as apparent from the following discussion, it isappreciated that throughout the description, discussions utilizing termssuch as “processing” or “computing” or “calculating” or “determining” or“displaying” or the like, refer to the action and processes of acomputer system, or similar electronic computing device, thatmanipulates and transforms data represented as physical (electronic)quantities within the computer system's registers and memories intoother data similarly represented as physical quantities within thecomputer system memories or registers or other information storage,transmission or display devices.

A data processing system suitable for storing and/or executing programcode will include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code in order to reduce the number of times code must beretrieved from bulk storage during execution.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modems and Ethernet cards are just a few of thecurrently available types of network adapters.

Various general-purpose systems may be used with programs in accordancewith the teachings herein, or it may prove convenient to construct morespecialized apparatus to perform the required method steps. The requiredstructure for a variety of these systems will appear from thedescription below. In addition, the present disclosure is not describedwith reference to any particular programming language. It will beappreciated that a variety of programming languages may be used toimplement the teachings of the disclosure as described herein.

Finally, the foregoing description of the implementations of the presentdisclosure has been presented for the purposes of illustration anddescription. It is not intended to be exhaustive or to limit the presentdisclosure to the precise form disclosed. Many modifications andvariations are possible in light of the above teaching. It is intendedthat the scope of the present disclosure be limited not by this detaileddescription, but rather by the claims of this application. As will beunderstood by those familiar with the art, the present disclosure may beembodied in other specific forms without departing from the spirit oressential characteristics thereof. Likewise, the particular naming anddivision of the modules, routines, features, attributes, methodologiesand other aspects are not mandatory or significant, and the mechanismsthat implement the present disclosure or its features may have differentnames, divisions and/or formats. Furthermore, the relevant art, themodules, routines, features, attributes, methodologies and other aspectsof the present disclosure can be implemented as software, hardware,firmware or any combination of the three. Also, wherever a component, anexample of which is a module, of the present disclosure is implementedas software, the component can be implemented as a standalone program,as part of a larger program, as a plurality of separate programs, as astatically or dynamically linked library, as a kernel loadable module,as a device driver, and/or in every and any other way known now or inthe future in the art of computer programming. Additionally, the presentdisclosure is in no way limited to implementation in any specificprogramming language, or for any specific operating system orenvironment. Accordingly, the disclosure of the present disclosure isintended to be illustrative, but not limiting, of the scope of thepresent disclosure, which is set forth in the following claims.

What is claimed is:
 1. A system comprising: a distributed storagesystem; and a metadata store coupled with the distributed storage systemand comprising: a plurality of data storage nodes, each data storagenode comprising: a first data storage medium configured to store a readcache; and a second data storage medium configured to store atransaction log; and a database manager configured to: receive a storagerequest from a client; create a transaction log entry based on thestorage request; replicate the transaction log entry in the transactionlog of each of the plurality of data storage nodes; and update a readcache on a subset of the plurality of data storage nodes with akey-value pair based on the storage request.
 2. The system of claim 1,wherein the storage request comprises a request to store, retrieve,update, or delete a data object stored in the distributed storagesystem.
 3. The system of claim 1, wherein the subset of the plurality ofdata storage nodes comprises a majority of data storage nodes of theplurality of data storage nodes.
 4. The system of claim 3, wherein thedatabase manager is further configured to update the read cache on thesubset of the plurality of data storage nodes prior to the read cachebeing updated on remaining data storage nodes of the plurality of datastorage nodes.
 5. The system of claim 1, wherein the database manager isfurther configured to: store a snapshot of the read cache from thesubset of the plurality of data storage nodes to the second data storagemedium on the subset of the plurality of data storage nodes; delete thetransaction log from the subset of the plurality of data storage nodes;and create a new transaction log on the subset of the plurality of datastorage nodes for storage requests subsequent to storing the snapshot ofthe read cache.
 6. The system of claim 1, wherein the first data storagemedium is a solid-state drive.
 7. The system of claim 1, wherein thesecond data storage medium is a hard disk drive.
 8. A method comprising:receiving, at a first data storage node of a plurality of data storagenodes in a metadata store, a storage request from a client; updating aread cache on a first data storage medium of a subset of the pluralityof data storage nodes with a key-value pair based on the storagerequest; creating, on a second data storage medium of the first datastorage node, a transaction log entry based on the storage request; andreplicating the transaction log entry in corresponding transaction logsof each remaining data storage node of the plurality of data storagenodes in the metadata store.
 9. The method of claim 8, wherein thestorage request comprises a request to store, retrieve, update, ordelete a data object stored in a distributed storage system associatedwith the metadata store.
 10. The method of claim 8, wherein the subsetof the plurality of data storage nodes comprises a majority of datastorage nodes of the plurality of data storage nodes.
 11. The method ofclaim 10, further comprising updating the read cache on the subset ofthe plurality of data storage nodes prior to the read cache beingupdated on remaining data storage nodes of the plurality of data storagenodes.
 12. The method of claim 8, further comprising: storing a snapshotof the read cache to the second data storage medium of the first datastorage node; deleting the transaction log from the second storagemedium of the first data storage node; and creating a new transactionlog on the second storage medium of the first data storage node forstorage requests subsequent to storing the snapshot of the read cache.13. The method of claim 8, wherein the first data storage medium is asolid-state drive.
 14. The method of claim 8, wherein the second datastorage medium is a hard disk drive.
 15. A system comprising: means forreceiving, at a first data storage node of a plurality of data storagenodes in a metadata store, a storage request from a client; means forupdating a read cache, on a first data storage medium of a subset of theplurality of data storage nodes, with a key-value pair based on thestorage request; means for creating, on a second data storage medium ofthe first data storage node, a transaction log entry based on thestorage request; means for replicating the transaction log entry incorresponding transaction logs of each remaining data storage node ofthe plurality of data storage nodes in the metadata store.
 16. Thesystem of claim 15, wherein the storage request comprises a request tostore, retrieve, update, or delete a data object stored in a distributedstorage system associated with the metadata store.
 17. The system ofclaim 15, wherein the subset of the plurality of data storage nodescomprises a majority of data storage nodes of the plurality of datastorage nodes.
 18. The system of claim 17, further comprising updatingthe read cache on the subset of the plurality of data storage nodesprior to the read cache being updated on remaining data storage nodes ofthe plurality of data storage nodes.
 19. The system of claim 15, furthercomprising: means for storing a snapshot of the read cache to the seconddata storage medium of the first data storage node; means for deletingthe transaction log from the second storage medium of the first datastorage node; and means for creating a new transaction log on the secondstorage medium of the first data storage node for storage requestssubsequent to storing the snapshot of the read cache.
 20. The system ofclaim 15, wherein: the first data storage medium is a solid-state drive;and the second data storage medium is a hard disk drive.